Size matters: RAG & Data Ingestion

If you throw a 100-page PDF uncut into a vector database, you shouldn't be surprised if the retrieval system can't find the needle in the haystack.
Here's what we've learned from developing our own RAG chatbot - and why we've abandoned the one standard.

The mistake: Fixed-size chunking
The simplest way is "fixed-size chunking". You chunk the text stubbornly every 500 tokens (perhaps with a little overlap). The problem? Sentences or logical connections are cut up mercilessly.

Chunk A: "...the price of the product is"

Chunk B: "50 euros, but only if you buy..."

The result: the semantic meaning is lost on average (semantic drift). The vector search no longer finds the context.

The game changer: semantic chunking & context awareness
We now rely on semantic chunking. We analyze the structure of the document (paragraphs, breaks) before we cut.

An example of a text splitter is RecursiveCharacterTextSplitter from Langchain. Depending on the content, there are others such as CodeSplitter for processing source codes in different programming languages or SemanticChunker, which checks the similarity between sentences using embeddings. If the topic represented by the embeddings changes, a new chunk is generated.

There are also paid APIs that provide specialized splitters for various use cases. Examples include LlamaIndex, Jina AI and Unstructured.

But the real pro-tip, which has massively increased quality: Header Enrichment.

A single paragraph ("The notice period is 3 months") is useless if you don't know what it refers to.

To optimize the process, it is therefore possible to enrich each chunk with metadata. Metadata can be, for example, the language of the content, keywords, years, categories, regional references or information about the authors. This metadata is then included in a search. Metadata enrichment makes a big difference to the quality of results, especially with very large amounts of data that are similar in content.

Additional metadata can make a big difference, especially with PDFs, as converting this data into structured text is often difficult and reading it, even with specialized tools such as Kreuzberg, Docling or Unstructured, often does not deliver satisfactory results.

Conclusion
Good RAG does not start with prompting, but with data preparation. If you skimp on chunking, you will pay for it later with hallucinations.
How do you deal with long documents? Fixed, semantic or recursive retrieval?

#RAG #DataEngineering #LLM #LangChain #KnowledgeManagement #TechDeepDive #AgencyLife